This is part of a bigger study you can find here
The expression data is provided as counts in a tab separate value format. In another file there is the phenoData associated to the samples.
count <- read.delim("../data/ALD_ramon/refseq_counts_ALD.tsv", row.names = 1)
phenoData <- read.csv("../data/ALD_ramon/NGS_AHsteps.design.csv", row.names = 1)
First we will explore a little bit the data, in case we need to correct something:
library("geneplotter")
summary(phenoData)
## Group Sample.id L G
## E :12 INTEAM_A01: 1 E :12 Early.ASH :12
## C :11 INTEAM_A02: 1 C :11 AH.Non.severe :11
## D :10 INTEAM_A03: 1 D :10 Explants.from.AH :10
## I :10 INTEAM_A04: 1 I :10 Normal.livers :10
## A : 9 INTEAM_A05: 1 A : 9 AH.Severe.-.non-responders: 9
## B : 9 INTEAM_A06: 1 B : 9 AH.Severe.-.responders : 9
## (Other):27 (Other) :82 (Other):27 (Other) :27
phenoData
## Group Sample.id L G
## A01 A INTEAM_A01 A AH.Severe.-.responders
## A02 A INTEAM_A02 A AH.Severe.-.responders
## A03 A INTEAM_A03 A AH.Severe.-.responders
## A04 A INTEAM_A04 A AH.Severe.-.responders
## A05 A INTEAM_A05 A AH.Severe.-.responders
## A06 A INTEAM_A06 A AH.Severe.-.responders
## A07 A INTEAM_A07 A AH.Severe.-.responders
## A08 A INTEAM_A08 A AH.Severe.-.responders
## A09 A INTEAM_A09 A AH.Severe.-.responders
## B10 B INTEAM_B10 B AH.Severe.-.non-responders
## B11 B INTEAM_B11 B AH.Severe.-.non-responders
## B12 B INTEAM_B12 B AH.Severe.-.non-responders
## B13 B INTEAM_B13 B AH.Severe.-.non-responders
## B14 B INTEAM_B14 B AH.Severe.-.non-responders
## B15 B INTEAM_B15 B AH.Severe.-.non-responders
## B16 B INTEAM_B16 B AH.Severe.-.non-responders
## B17 B INTEAM_B17 B AH.Severe.-.non-responders
## B18 B INTEAM_B18 B AH.Severe.-.non-responders
## C19 C INTEAM_C19 C AH.Non.severe
## C20 C INTEAM_C20 C AH.Non.severe
## C21 C INTEAM_C21 C AH.Non.severe
## C22 C INTEAM_C22 C AH.Non.severe
## C23 C INTEAM_C23 C AH.Non.severe
## C24 C INTEAM_C24 C AH.Non.severe
## C25 C INTEAM_C25 C AH.Non.severe
## C26 C INTEAM_C26 C AH.Non.severe
## C27 C INTEAM_C27 C AH.Non.severe
## C28 C INTEAM_C28 C AH.Non.severe
## C29 C INTEAM_C29 C AH.Non.severe
## D30 D INTEAM_D30 D Explants.from.AH
## D31 D INTEAM_D31 D Explants.from.AH
## D32 D INTEAM_D32 D Explants.from.AH
## D34 D INTEAM_D34 D Explants.from.AH
## D35 D INTEAM_D35 D Explants.from.AH
## D36 D INTEAM_D36 D Explants.from.AH
## D37 D INTEAM_D37 D Explants.from.AH
## D38 D INTEAM_D38 D Explants.from.AH
## D39 D INTEAM_D39 D Explants.from.AH
## D40 D INTEAM_D40 D Explants.from.AH
## E41 E INTEAM_E41 E Early.ASH
## E42 E INTEAM_E42 E Early.ASH
## E43 E INTEAM_E43 E Early.ASH
## E44 E INTEAM_E44 E Early.ASH
## E45 E INTEAM_E45 E Early.ASH
## E46 E INTEAM_E46 E Early.ASH
## E47 E INTEAM_E47 E Early.ASH
## E48 E INTEAM_E48 E Early.ASH
## E49 E INTEAM_E49 E Early.ASH
## E50 E INTEAM_E50 E Early.ASH
## E51 E INTEAM_E51 E Early.ASH
## E52 E INTEAM_E52 E Early.ASH
## F53 F INTEAM_F53 F Hepatitis.C
## F55 F INTEAM_F55 F Hepatitis.C
## F56 F INTEAM_F56 F Hepatitis.C
## F57 F INTEAM_F57 F Hepatitis.C
## F58 F INTEAM_F58 F Hepatitis.C
## F59 F INTEAM_F59 F Hepatitis.C
## F60 F INTEAM_F60 F Hepatitis.C
## F61 F INTEAM_F61 F Hepatitis.C
## F62 F INTEAM_F62 F Hepatitis.C
## G63 G INTEAM_G63 G NASH
## G64 G INTEAM_G64 G NASH
## G65 G INTEAM_G65 G NASH
## G66 G INTEAM_G66 G NASH
## G67 G INTEAM_G67 G NASH
## G68 G INTEAM_G68 G NASH
## G69 G INTEAM_G69 G NASH
## G70 G INTEAM_G70 G NASH
## G71 G INTEAM_G71 G NASH
## H72 H INTEAM_H72 H Compensated.cirrhosis
## H73 H INTEAM_H73 H Compensated.cirrhosis
## H74 H INTEAM_H74 H Compensated.cirrhosis
## H75 H INTEAM_H75 H Compensated.cirrhosis
## H76 H INTEAM_H76 H Compensated.cirrhosis
## H77 H INTEAM_H77 H Compensated.cirrhosis
## H78 H INTEAM_H78 H Compensated.cirrhosis
## H79 H INTEAM_H79 H Compensated.cirrhosis
## H80 H INTEAM_H80 H Compensated.cirrhosis
## I89 I INTEAM_I89 I Normal.livers
## I90 I INTEAM_I90 I Normal.livers
## I91 I INTEAM_I91 I Normal.livers
## I92 I INTEAM_I92 I Normal.livers
## I93 I INTEAM_I93 I Normal.livers
## I94 I INTEAM_I94 I Normal.livers
## I95 I INTEAM_I95 I Normal.livers
## I96 I INTEAM_I96 I Normal.livers
## I97 I INTEAM_I97 I Normal.livers
## I98 I INTEAM_I98 I Normal.livers
count[1:5, 1:5]
## A01 A02 A03 A04 A05
## DDX11L1 3 8 0 2 2
## WASH7P 43 82 104 66 82
## MIR6859-3 0 0 0 0 0
## MIR6859-4 0 0 0 0 0
## MIR6859-1 0 0 0 0 0
So we will delete two colums that has the same information as the column G. And we will use more easily readable names:
phenoData <- phenoData[, c("Sample.id", "G")]
colnames(phenoData) <- c("Sample", "Status")
levels(phenoData$Status) <- c("AH", "Non-responders", "Responders", "C.Comp",
"ASH", "Explants", "HCV", "NASH", "Normal")
phenoData
## Sample Status
## A01 INTEAM_A01 Responders
## A02 INTEAM_A02 Responders
## A03 INTEAM_A03 Responders
## A04 INTEAM_A04 Responders
## A05 INTEAM_A05 Responders
## A06 INTEAM_A06 Responders
## A07 INTEAM_A07 Responders
## A08 INTEAM_A08 Responders
## A09 INTEAM_A09 Responders
## B10 INTEAM_B10 Non-responders
## B11 INTEAM_B11 Non-responders
## B12 INTEAM_B12 Non-responders
## B13 INTEAM_B13 Non-responders
## B14 INTEAM_B14 Non-responders
## B15 INTEAM_B15 Non-responders
## B16 INTEAM_B16 Non-responders
## B17 INTEAM_B17 Non-responders
## B18 INTEAM_B18 Non-responders
## C19 INTEAM_C19 AH
## C20 INTEAM_C20 AH
## C21 INTEAM_C21 AH
## C22 INTEAM_C22 AH
## C23 INTEAM_C23 AH
## C24 INTEAM_C24 AH
## C25 INTEAM_C25 AH
## C26 INTEAM_C26 AH
## C27 INTEAM_C27 AH
## C28 INTEAM_C28 AH
## C29 INTEAM_C29 AH
## D30 INTEAM_D30 Explants
## D31 INTEAM_D31 Explants
## D32 INTEAM_D32 Explants
## D34 INTEAM_D34 Explants
## D35 INTEAM_D35 Explants
## D36 INTEAM_D36 Explants
## D37 INTEAM_D37 Explants
## D38 INTEAM_D38 Explants
## D39 INTEAM_D39 Explants
## D40 INTEAM_D40 Explants
## E41 INTEAM_E41 ASH
## E42 INTEAM_E42 ASH
## E43 INTEAM_E43 ASH
## E44 INTEAM_E44 ASH
## E45 INTEAM_E45 ASH
## E46 INTEAM_E46 ASH
## E47 INTEAM_E47 ASH
## E48 INTEAM_E48 ASH
## E49 INTEAM_E49 ASH
## E50 INTEAM_E50 ASH
## E51 INTEAM_E51 ASH
## E52 INTEAM_E52 ASH
## F53 INTEAM_F53 HCV
## F55 INTEAM_F55 HCV
## F56 INTEAM_F56 HCV
## F57 INTEAM_F57 HCV
## F58 INTEAM_F58 HCV
## F59 INTEAM_F59 HCV
## F60 INTEAM_F60 HCV
## F61 INTEAM_F61 HCV
## F62 INTEAM_F62 HCV
## G63 INTEAM_G63 NASH
## G64 INTEAM_G64 NASH
## G65 INTEAM_G65 NASH
## G66 INTEAM_G66 NASH
## G67 INTEAM_G67 NASH
## G68 INTEAM_G68 NASH
## G69 INTEAM_G69 NASH
## G70 INTEAM_G70 NASH
## G71 INTEAM_G71 NASH
## H72 INTEAM_H72 C.Comp
## H73 INTEAM_H73 C.Comp
## H74 INTEAM_H74 C.Comp
## H75 INTEAM_H75 C.Comp
## H76 INTEAM_H76 C.Comp
## H77 INTEAM_H77 C.Comp
## H78 INTEAM_H78 C.Comp
## H79 INTEAM_H79 C.Comp
## H80 INTEAM_H80 C.Comp
## I89 INTEAM_I89 Normal
## I90 INTEAM_I90 Normal
## I91 INTEAM_I91 Normal
## I92 INTEAM_I92 Normal
## I93 INTEAM_I93 Normal
## I94 INTEAM_I94 Normal
## I95 INTEAM_I95 Normal
## I96 INTEAM_I96 Normal
## I97 INTEAM_I97 Normal
## I98 INTEAM_I98 Normal
However we are only interested on those related to alcohol disease, so we need to subset them:
alcohol <- !phenoData$Status %in% c("HCV", "NASH", "Explants")
phenoData <- phenoData[alcohol, ]
count <- count[, colnames(count) %in% rownames(phenoData)]
We can calculate the normalization factors used to calculate the Counts per million read:
library("edgeR")
dge <- DGEList(counts = count, group = phenoData$Status)
ord <- order(dge$sample$lib.size)
barplot(dge$sample$lib.size[ord]/1e6, las = 1, ylab = "Millions of reads",
xlab = "Samples", main = "Library size of the samples",
col = droplevels(phenoData$Status))
legend("topleft", legend = levels(droplevels(phenoData$Status)),
fill = seq_along(levels(droplevels(phenoData$Status))))
dge <- calcNormFactors(dge)
We can normalize the data to counts per millon of reads taking into account that we know the normalized library size:
expression <- cpm(dge, normalized.lib.sizes = TRUE, log = TRUE,
prior.count = 3)
We can observe the quality of the samples with a boxplot and the density of expressions:
boxplot(expression, main = "Expression per sample",
col = droplevels(phenoData$Status))
multidensity(expression, main = "Densities of expression", legend = NULL)
avgexp <- rowMeans(expression)
hist(avgexp, main = "Histogram of mean expression per gene")
abline(v = 1, col = "red", lwd = 2)
There is a lot of differences between samples on the low expressed genes. We filter those genes that are below 1, as they are unreliable:
expression <- expression[avgexp > 1, ]
multidensity(expression, main = "Densities of expression", legend = NULL)
Now the distribution of the expression between samples is much more comparable. Following the recomendation of the edgeR package we recalculate the normalization factor without the lowly expressed genes:
dge <- calcNormFactors(dge[avgexp > 1, ])
expression <- cpm(dge, normalized.lib.sizes = TRUE, log = TRUE,
prior.count = 3)
We can check how has this normalization affected the values:
boxplot(expression, main = "Expression per sample", col =
droplevels(phenoData$Status))
multidensity(expression, main = "Densities of expression", legend = NULL)
avgexp <- rowMeans(expression)
hist(avgexp, main = "Histogram of mean expression per gene")
Now are a little bit more comparable.
To work easier with the data I proceed to create an ExpressionSet
library("Biobase")
library("AnnotationDbi")
ALD <- ExpressionSet(as.matrix(expression))
phenoData(ALD) <- AnnotatedDataFrame(droplevels(phenoData))
ALD
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 13366 features, 60 samples
## element names: exprs
## protocolData: none
## phenoData
## rowNames: A01 A02 ... I98 (60 total)
## varLabels: Sample Status
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
We take advantatge of the plotMDS function of limma package to plot the MDS of the samples, to see how close are between them.
plotMDS(ALD, top = Inf, labels = phenoData(ALD)$Status, main = "Samples relationships")
We can see that they are clearly separated in two groups, but some samples like the Alcohol Hepatitis samples are on both sites of the first dimension. We can also observe that the cirrhosis compensated samples are closer to the normal or healthy samples.
We are interested in the progression of the disease, both how does the expression of genes change in each step, and the overall change that leads the disease.
The design we will use to find the differentially expressed genes in each step is the following:
# To ensure sintactically valid names
names <- make.names(levels(phenoData(ALD)$Status))
design <- sapply(names, function(x){
x <- ifelse(x == "Non.responders", "Non-responders", x)
ifelse(phenoData(ALD)$Status == x, 1, 0)
})
rownames(design) <- rownames(phenoData(ALD))
design
## AH Non.responders Responders C.Comp ASH Normal
## A01 0 0 1 0 0 0
## A02 0 0 1 0 0 0
## A03 0 0 1 0 0 0
## A04 0 0 1 0 0 0
## A05 0 0 1 0 0 0
## A06 0 0 1 0 0 0
## A07 0 0 1 0 0 0
## A08 0 0 1 0 0 0
## A09 0 0 1 0 0 0
## B10 0 1 0 0 0 0
## B11 0 1 0 0 0 0
## B12 0 1 0 0 0 0
## B13 0 1 0 0 0 0
## B14 0 1 0 0 0 0
## B15 0 1 0 0 0 0
## B16 0 1 0 0 0 0
## B17 0 1 0 0 0 0
## B18 0 1 0 0 0 0
## C19 1 0 0 0 0 0
## C20 1 0 0 0 0 0
## C21 1 0 0 0 0 0
## C22 1 0 0 0 0 0
## C23 1 0 0 0 0 0
## C24 1 0 0 0 0 0
## C25 1 0 0 0 0 0
## C26 1 0 0 0 0 0
## C27 1 0 0 0 0 0
## C28 1 0 0 0 0 0
## C29 1 0 0 0 0 0
## E41 0 0 0 0 1 0
## E42 0 0 0 0 1 0
## E43 0 0 0 0 1 0
## E44 0 0 0 0 1 0
## E45 0 0 0 0 1 0
## E46 0 0 0 0 1 0
## E47 0 0 0 0 1 0
## E48 0 0 0 0 1 0
## E49 0 0 0 0 1 0
## E50 0 0 0 0 1 0
## E51 0 0 0 0 1 0
## E52 0 0 0 0 1 0
## H72 0 0 0 1 0 0
## H73 0 0 0 1 0 0
## H74 0 0 0 1 0 0
## H75 0 0 0 1 0 0
## H76 0 0 0 1 0 0
## H77 0 0 0 1 0 0
## H78 0 0 0 1 0 0
## H79 0 0 0 1 0 0
## H80 0 0 0 1 0 0
## I89 0 0 0 0 0 1
## I90 0 0 0 0 0 1
## I91 0 0 0 0 0 1
## I92 0 0 0 0 0 1
## I93 0 0 0 0 0 1
## I94 0 0 0 0 0 1
## I95 0 0 0 0 0 1
## I96 0 0 0 0 0 1
## I97 0 0 0 0 0 1
## I98 0 0 0 0 0 1
As you can see each group has its own column to increase the power to the model, and to allow for more informative contrasts. To see the overall change of the disease we can use only one column. We are not interested in other phenotipic change, so how the patient responds to drug is not interesting to us, and it is considered as a single category. In the previous design we will check that Responders and Non-responders are almost equal in gene expression.
refact <- function(x) {
if (x == "Normal") {
0
} else if ( x == "ASH") {
1
} else if (x == "C.Comp") {
2
} else if (x == "AH") {
4
} else if (x == "Responders") {
5
} else if (x == "Non-responders") {
5
} else {
stop("Unexpected level")
}
}
design2 <- sapply(as.character(phenoData(ALD)$Status), refact)
design2 <- as.matrix(design2)
design2 <- cbind(rep(1, ncol(design2)), design2)
colnames(design2) <- c("(Intercept)", "Progression")
rownames(design2) <- rownames(phenoData(ALD))
design2
## (Intercept) Progression
## A01 1 5
## A02 1 5
## A03 1 5
## A04 1 5
## A05 1 5
## A06 1 5
## A07 1 5
## A08 1 5
## A09 1 5
## B10 1 5
## B11 1 5
## B12 1 5
## B13 1 5
## B14 1 5
## B15 1 5
## B16 1 5
## B17 1 5
## B18 1 5
## C19 1 4
## C20 1 4
## C21 1 4
## C22 1 4
## C23 1 4
## C24 1 4
## C25 1 4
## C26 1 4
## C27 1 4
## C28 1 4
## C29 1 4
## E41 1 1
## E42 1 1
## E43 1 1
## E44 1 1
## E45 1 1
## E46 1 1
## E47 1 1
## E48 1 1
## E49 1 1
## E50 1 1
## E51 1 1
## E52 1 1
## H72 1 2
## H73 1 2
## H74 1 2
## H75 1 2
## H76 1 2
## H77 1 2
## H78 1 2
## H79 1 2
## H80 1 2
## I89 1 0
## I90 1 0
## I91 1 0
## I92 1 0
## I93 1 0
## I94 1 0
## I95 1 0
## I96 1 0
## I97 1 0
## I98 1 0
Normalize the expression by calculating an appropriate observation weight.
v1 <- voom(dge, design, plot = TRUE, normalize.method = "cyclicloess")
v2 <- voom(dge, design2, plot = TRUE, normalize.method = "cyclicloess")
Note that even if we filter those with low expression we assign the correct library size, to avoid over estimation.
We store the normalized expression and the pehnotypic data for uses on the network construction step.
save(v1, v2, phenoData, file = "Expression_ALD.RData")
We can observe how good they behave
library("sva")
mod0 <- matrix(1, nrow = nrow(design))
pValues <- f.pvalue(v1$E, design, mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")
pValues <- f.pvalue(v2$E, design2, mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")
We can further estimate the number of surrogate variables and the surrogate variables with the package sva:
# Our null model is that all samples express the same
sv1 <- sva(v1$E, design, mod0)
## Number of significant surrogate variables is: 9
## Iteration (out of 5 ):1 2 3 4 5
head(sv1$sv)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.005481518 -0.141434253 0.1587707693 -0.06776785 0.01947042
## [2,] -0.094224387 0.013050722 0.1739487832 0.18262989 0.09821819
## [3,] 0.213856473 0.073508014 -0.0910537884 0.05816032 -0.15425493
## [4,] -0.106717321 0.005084984 0.0003759116 -0.02016505 -0.20718822
## [5,] 0.245334581 0.084746576 0.0853614231 -0.07844789 -0.01199353
## [6,] -0.194482272 0.216002809 0.1049633258 -0.20552222 -0.03118649
## [,6] [,7] [,8] [,9]
## [1,] -0.12280680 0.09382878 -0.13375748 0.018234721
## [2,] -0.12500493 0.02917357 0.31395361 -0.104017796
## [3,] -0.06398647 0.05341420 0.01274342 -0.079779581
## [4,] -0.10887725 0.03853589 0.09170448 -0.005219599
## [5,] 0.13953441 -0.06806961 0.27700219 -0.123476931
## [6,] 0.22603306 -0.05431003 0.07015424 0.149221537
sv2 <- sva(v2$E, design2, mod0)
## Number of significant surrogate variables is: 9
## Iteration (out of 5 ):1 2 3 4 5
head(sv2$sv)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.08116204 -0.132977734 0.06674957 -0.07416035 0.01064854
## [2,] -0.06633221 -0.099896707 0.05749157 0.10956878 -0.12377807
## [3,] 0.06428549 0.022845989 -0.16266685 0.04152445 0.18203752
## [4,] -0.05189496 0.006240558 0.07228886 -0.03218212 0.13326773
## [5,] 0.06107009 -0.060973092 -0.17842932 0.02615797 0.04503047
## [6,] -0.22991834 -0.024717621 0.01344964 -0.14268577 -0.12630776
## [,6] [,7] [,8] [,9]
## [1,] -0.006770487 0.091989131 -0.011010203 0.08789165
## [2,] 0.055690674 0.108265559 0.269158006 0.01913017
## [3,] -0.026099900 -0.043709568 0.004741666 -0.03763857
## [4,] -0.123186222 0.093950042 0.045071074 -0.02029857
## [5,] 0.053279584 -0.009233111 0.157958205 0.11013432
## [6,] -0.161235760 0.051120619 0.063993241 -0.11412902
We can append the estimated surrogated variables to the design matrix for a better estimation of the effect of each paramter:
i.design <- cbind(design, sv1$sv)
colnames(i.design) <- c(colnames(design), paste0("X",
1:ncol(sv1$sv)))
i.design2 <- cbind(design2, sv2$sv)
colnames(i.design2) <- c(colnames(design2),
paste0("X", 1:ncol(sv2$sv)))
i.mod0 <- cbind(mod0, sv2$sv)
colnames(i.design2) <- c(colnames(design2),
paste0("X", 1:ncol(sv2$sv)))
We can observe how this adjustment has helped to uniform the p-values. As we would expect a uniform probability of each gene being differentially expressed.
pValues <- f.pvalue(v1$E, i.design, i.mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")
pValues <- f.pvalue(v2$E, i.design2, i.mod0)
qValues <- p.adjust(pValues, method = "BH")
hist(qValues, sub = "Using surrogate variables")
As we can see the surrogate variables, don’t help to improve the data ajustment, because the distribution of the qValues is not uniform. We fit each design with the expression of the alcoholic liver disease:
fit <- lmFit(v1$E, design)
fit.2 <- lmFit(v2$E, design2)
comparisons, in the first design:
contrasts0 <- makeContrasts(
"ASHvsNormal" = ASH - Normal,
"CirrhosisVsNormal" = C.Comp - Normal,
"AHvsNormal" = AH - Normal,
"RespondersVsNormal" = Responders - Normal,
"Non.respondersVsNormal" = Non.responders - Normal,
# Assuming that some status are similar
"SevereVsNormal" = (Responders + Non.responders)/2 - Normal,
"RespondersVsNon.responders" = Responders - Non.responders,
"CirrhosisVsASH" = C.Comp - ASH,
"AHvsCirrhosis" = AH - C.Comp,
"Non.respondersVsAH" = Non.responders - AH,
"RespondersVsAH" = Responders - AH,
"SevereVsHepatitis" = (Non.responders + Responders)/2 - AH,
levels = design)
head(contrasts0)
## Contrasts
## Levels ASHvsNormal CirrhosisVsNormal AHvsNormal
## AH 0 0 1
## Non.responders 0 0 0
## Responders 0 0 0
## C.Comp 0 1 0
## ASH 1 0 0
## Normal -1 -1 -1
## Contrasts
## Levels RespondersVsNormal Non.respondersVsNormal SevereVsNormal
## AH 0 0 0.0
## Non.responders 0 1 0.5
## Responders 1 0 0.5
## C.Comp 0 0 0.0
## ASH 0 0 0.0
## Normal -1 -1 -1.0
## Contrasts
## Levels RespondersVsNon.responders CirrhosisVsASH AHvsCirrhosis
## AH 0 0 1
## Non.responders -1 0 0
## Responders 1 0 0
## C.Comp 0 1 -1
## ASH 0 -1 0
## Normal 0 0 0
## Contrasts
## Levels Non.respondersVsAH RespondersVsAH SevereVsHepatitis
## AH -1 -1 -1.0
## Non.responders 1 0 0.5
## Responders 0 1 0.5
## C.Comp 0 0 0.0
## ASH 0 0 0.0
## Normal 0 0 0.0
Given those comparisons of interest we now evaluate the results:
fit2 <- contrasts.fit(fit, contrasts0)
fit2 <- eBayes(fit2)
results <- decideTests(fit2, adjust.method = "BH", lfc = log2(2))
summary(results)
## ASHvsNormal CirrhosisVsNormal AHvsNormal RespondersVsNormal
## -1 426 498 748 1224
## 0 12570 12343 11767 10776
## 1 370 525 851 1366
## Non.respondersVsNormal SevereVsNormal RespondersVsNon.responders
## -1 1351 1281 0
## 0 10619 10728 13366
## 1 1396 1357 0
## CirrhosisVsASH AHvsCirrhosis Non.respondersVsAH RespondersVsAH
## -1 189 512 194 149
## 0 12888 12258 13123 13098
## 1 289 596 49 119
## SevereVsHepatitis
## -1 195
## 0 13058
## 1 113
fit2.2 <- eBayes(fit.2)
results2 <- decideTests(fit2.2, adjust.method = "BH", lfc = log2(2))
summary(results2)
## (Intercept) Progression
## -1 10 13
## 0 505 13344
## 1 12851 9
We can observe the quality of the t-values over the teoretical quantiles to observe if there is any assumption about the eBayes fitting which doesn’t holds. Note that I already modified the expected proportion of genes in the call to eBayes:
par(mfrow = c(ncol(results), 3))
out <- sapply(colnames(results), function(x){
qqt(fit2$t[, x], df = fit2$df.total, main = paste("Student's t Q-Q Plot of", x))
abline(0, 1)
volcanoplot(fit2, coef = x, main = x)
plotMD(fit2, coef = x, main = x)
})
We can visualize the same for the other model we have:
par(mfrow = c(1, 3))
qqt(fit2.2$t[, "Progression"], df = fit2.2$df.total,
main = "Student's t Q-Q Plot of Progression")
abline(0, 1)
volcanoplot(fit2.2, coef = "Progression", main = "Progression")
plotMD(fit2.2, coef = "Progression", main = "Progression")
Here we can see on the last plot that the expression of the genes show a bimodal distribution in the progression of the dissease. Which seems to confirm that at one point it descompensates and has a very bad prognossis.
par(mfrow = c(1, 2))
plotSA(fit2, main = "First model: Mean−variance trend")
plotSA(fit2.2, main = "Progression model: Mean−variance trend")
It is a good idea to store the data, not only the program of your analysis, so here I go:
save(fit, fit2, fit2.2, fit.2, design, design2, dge, ALD, contrasts0, v1, v2,
file = "ALD.RData")
We can plot the top 2000 genes with the highest absolute value of log fold-change in each contrast just for informative purposes:
par(mfrow = c(ncol(contrasts0), 2))
out <- sapply(colnames(contrasts0), function(x) {
tt.ALD <- topTable(fit2, coef = x, sort.by = "logFC", number = Inf)
tt.ALD <- tt.ALD[order(-abs(tt.ALD$logFC)), ]
plot(density(tt.ALD[1:2000, "logFC"]), main = paste("Density of", x))
hist(tt.ALD[1:2000, "logFC"], main = paste("Histogram of", x), xlab = "logFC")
})
## Progression For the progression model we can plot it too those top 2000 significative at the threshold of 0.05:
par(mfrow = c(1, 2))
tt.ALD2 <- topTable(fit2.2, coef = "Progression", sort.by = "logFC",
number = Inf)
tt.ALD2 <- tt.ALD2[order(-abs(tt.ALD2$logFC)), ]
signif <- tt.ALD2[tt.ALD2$adj.P.Val < 0.05, ] # Subset of significant p-value
plot(density(signif[1:2000, "logFC"]), main = "Density")
hist(signif[1:2000, "logFC"], main = "Histogram", xlab = "logFC")
In order to store them we proceed with:
write.csv(tt.ALD2, file = "ALD_Progression.csv", na = "", row.names = FALSE)
We stored all the table with 13366 genes, even though some are not significant.
These are the genes relevant of the alocholic liver disease progression. There are thus 1034 genes overexpressed in alcoholic liver disease and 966 downregulated in the alcoholic liver disease, which are the main contributers to the disease progression. However the most significant way to see if there is a progression is using another thest the Kendall test: ### Kendall test This test check if a features has a constant grow or decrease among the samples, it is similar to a correlation test.
library("Kendall")
constant_change <- lapply(rownames(v2$E), function(x) {
Kendall(v2$E[x, ], design2[, "Progression"])})
names(constant_change) <- rownames(v2$E)
head(constant_change)
## $WASH7P
## tau = -0.0209, 2-sided pvalue =0.83392
##
## $MIR6723
## tau = -0.272, 2-sided pvalue =0.0048395
##
## $LOC100288069
## tau = -0.0663, 2-sided pvalue =0.49559
##
## $LINC01128
## tau = -0.48, 2-sided pvalue =6.8098e-07
##
## $SAMD11
## tau = 0.353, 2-sided pvalue =0.00025594
##
## $NOC2L
## tau = 0.223, 2-sided pvalue =0.021086
# Extract the appropiate values
kendall_tau <- sapply(constant_change, getElement, name = "tau")
kendall_pvalue <- sapply(constant_change, getElement, name = "sl")
# Store the names of the genes with a constant progression
genes_progression <- names(kendall_tau[kendall_pvalue < 0.05])
up <- names(kendall_tau[kendall_pvalue < 0.05 & kendall_tau > 0L])
down <- names(kendall_tau[kendall_pvalue < 0.05 & kendall_tau < 0L])
# Plot the histogram
hist(kendall_tau[kendall_pvalue < 0.05], xlab = "tau",
main = "Histogram of Kendall taus",
sub = "Significant with a 0.05 threshold")
Those genes show a trend to increase or decrease with the progression of the disease.
We can also see which genes are differentially expressed in each phase, and which are shared:
vennDiagram(results[, c(1:3, 6)], include = "up",
circle.col = c("red", "green", "black", "blue"),
main = "Genes Up-regulated in the disease")
vennDiagram(results[, c(1:3, 6)], include = "down",
circle.col = c("red", "green", "black", "blue"),
main = "Genes Down-regulated in the disease")
We can see that there are 299 genes which are differencially expressed significantly with an adjusted p-value below 0.05.
But we can plot it sepparatedly for each step, for a better comparison:
vennDiagram(results[, c(1, 2)], include = "up",
circle.col = c("green", "black"),
main = "Genes Up-regulated shared")
vennDiagram(results[, c(1, 2)], include = "down",
circle.col = c("green", "black"),
main = "Genes Down-regulated")
vennDiagram(results[, c(2, 3)], include = "up",
circle.col = c("green", "black"),
main = "Genes Up-regulated shared")
vennDiagram(results[, c(2, 3)], include = "down",
circle.col = c("green", "black"),
main = "Genes Down-regulated")
vennDiagram(results[, c(3, 6)], include = "up",
circle.col = c("green", "black"),
main = "Genes Up-regulated shared")
vennDiagram(results[, c(3, 6)], include = "down",
circle.col = c("green", "black"),
main = "Genes Down-regulated")
We can check for functional enrichment, to see if those few significant genes are more related to performing certain functions and processes. I check for pathways using the Reactome data base and the process using Gene Ontologies.
We can see in Reactome which pathways are enriched for those genes, that are significantly differentially expressed.
library("ReactomePA")
## Loading required package: DOSE
## DOSE v3.0.9 For help: https://guangchuangyu.github.io/DOSE
##
## If you use DOSE in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
##
## Attaching package: 'DOSE'
## The following object is masked from 'package:lattice':
##
## dotplot
## ReactomePA v1.18.1 For help: https://guangchuangyu.github.io/ReactomePA
##
## If you use ReactomePA in published research, please cite:
## Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
library("clusterProfiler")
## clusterProfiler v3.2.10 For help: https://guangchuangyu.github.io/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu., Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
DEG_up <- mapIds(org.Hs.eg.db, keys = up, keytype = "SYMBOL",
column = "ENTREZID")
## 'select()' returned 1:1 mapping between keys and columns
DEG_down <- mapIds(org.Hs.eg.db, keys = down, keytype = "SYMBOL",
column = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
groups <- list(Up = DEG_up, Down = DEG_down)
cc <- compareCluster(groups, fun = "enrichPathway")
dotplot(cc)
We can observe differences in the comparison between genes up-regulated and dow-regulated. But for a more general overview is better to see all the genes involved
DEG_names <- mapIds(org.Hs.eg.db, keys = genes_progression,
keytype = "SYMBOL", column = "ENTREZID")
## 'select()' returned 1:many mapping between keys and columns
enrich <- enrichPathway(DEG_names, minGSSize = 2, maxGSSize = 2000)
write.csv(as.data.frame(enrich), file = "Reactome_DR.csv",
row.names = FALSE, na = "")
dotplot(enrich)
enrichMap(enrich, layout = igraph::layout_nicely,
vertex.label.cex = 1, n = 15)
Now we can observe which pathways are enriched with those genes, there is a functional difference between the genes up and down regulated.
We can observe in which biological process they are using topGO:
library("topGO")
topDiffGenes <- function(x) {
return(x <= 0.05)
}
GOdata.bp <- new("topGOdata",
ontology = "BP",
description = "Biological process of the signature module.",
allGenes = kendall_pvalue,
annot = annFUN.org,
ID = "symbol",
mapping = "org.Hs.eg",
geneSel = topDiffGenes,
nodeSize = 5)
##
## Building most specific GOs .....
## ( 10017 GO terms found. )
##
## Build GO DAG topology ..........
## ( 13905 GO terms and 32890 relations. )
##
## Annotating nodes ...............
## ( 11178 genes annotated to the GO terms. )
save(GOdata.bp, file = "GO_ALD.RData")
resultFisher <- runTest(GOdata.bp, algorithm = "classic", statistic = "fisher")
##
## -- Classic Algorithm --
##
## the algorithm is scoring 7509 nontrivial nodes
## parameters:
## test statistic: fisher
resultKS.weight <- runTest(GOdata.bp, algorithm = 'weight01', statistic = "ks")
##
## -- Weight01 Algorithm --
##
## the algorithm is scoring 7509 nontrivial nodes
## parameters:
## test statistic: ks
## score order: increasing
##
## Level 19: 3 nodes to be scored (0 eliminated genes)
##
## Level 18: 6 nodes to be scored (0 eliminated genes)
##
## Level 17: 10 nodes to be scored (17 eliminated genes)
##
## Level 16: 25 nodes to be scored (48 eliminated genes)
##
## Level 15: 79 nodes to be scored (105 eliminated genes)
##
## Level 14: 164 nodes to be scored (296 eliminated genes)
##
## Level 13: 305 nodes to be scored (826 eliminated genes)
##
## Level 12: 448 nodes to be scored (1711 eliminated genes)
##
## Level 11: 670 nodes to be scored (3337 eliminated genes)
##
## Level 10: 871 nodes to be scored (4737 eliminated genes)
##
## Level 9: 1055 nodes to be scored (6652 eliminated genes)
##
## Level 8: 1064 nodes to be scored (7965 eliminated genes)
##
## Level 7: 1077 nodes to be scored (9007 eliminated genes)
##
## Level 6: 869 nodes to be scored (9770 eliminated genes)
##
## Level 5: 530 nodes to be scored (10311 eliminated genes)
##
## Level 4: 249 nodes to be scored (10679 eliminated genes)
##
## Level 3: 63 nodes to be scored (10832 eliminated genes)
##
## Level 2: 20 nodes to be scored (10961 eliminated genes)
##
## Level 1: 1 nodes to be scored (11019 eliminated genes)
resultKS.elim <- runTest(GOdata.bp, algorithm = "elim", statistic = "ks")
##
## -- Elim Algorithm --
##
## the algorithm is scoring 7509 nontrivial nodes
## parameters:
## test statistic: ks
## cutOff: 0.01
## score order: increasing
##
## Level 19: 3 nodes to be scored (0 eliminated genes)
##
## Level 18: 6 nodes to be scored (0 eliminated genes)
##
## Level 17: 10 nodes to be scored (0 eliminated genes)
##
## Level 16: 25 nodes to be scored (0 eliminated genes)
##
## Level 15: 79 nodes to be scored (0 eliminated genes)
##
## Level 14: 164 nodes to be scored (67 eliminated genes)
##
## Level 13: 305 nodes to be scored (91 eliminated genes)
##
## Level 12: 448 nodes to be scored (365 eliminated genes)
##
## Level 11: 670 nodes to be scored (674 eliminated genes)
##
## Level 10: 871 nodes to be scored (1277 eliminated genes)
##
## Level 9: 1055 nodes to be scored (1913 eliminated genes)
##
## Level 8: 1064 nodes to be scored (2636 eliminated genes)
##
## Level 7: 1077 nodes to be scored (3500 eliminated genes)
##
## Level 6: 869 nodes to be scored (3928 eliminated genes)
##
## Level 5: 530 nodes to be scored (4809 eliminated genes)
##
## Level 4: 249 nodes to be scored (5359 eliminated genes)
##
## Level 3: 63 nodes to be scored (6200 eliminated genes)
##
## Level 2: 20 nodes to be scored (6289 eliminated genes)
##
## Level 1: 1 nodes to be scored (6289 eliminated genes)
allRes <- GenTable(GOdata.bp, classic = resultFisher, weight01 = resultKS.weight,
elim = resultKS.elim, orderBy = "weight01", topNodes = 1000)
allRes <- allRes[allRes$weight01 < 0.05, ]
write.csv(allRes, file = "GO_ALD_progression.csv", row.names = FALSE)
head(allRes)
## GO.ID Term Annotated Significant
## 11 GO:0070125 mitochondrial translational elongation 82 68
## 12 GO:0002576 platelet degranulation 94 80
## 13 GO:0046951 ketone body biosynthetic process 7 7
## 14 GO:0097267 omega-hydroxylase P450 pathway 9 9
## 15 GO:0019439 aromatic compound catabolic process 369 268
## 16 GO:0008202 steroid metabolic process 221 168
## Expected Rank in weight01 classic weight01 elim
## 11 57.55 11 0.00597 0.00014 0.00014
## 12 65.97 12 0.00060 0.00018 0.00018
## 13 4.91 13 0.08380 0.00021 0.00021
## 14 6.32 14 0.04125 0.00021 0.00021
## 15 258.97 15 0.16192 0.00024 0.24177
## 16 155.10 16 0.03093 0.00035 6.6e-05
showSigOfNodes(GOdata.bp, score(resultKS.weight), firstSigNodes = 10, useInfo = 'all')
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 131
## Number of Edges = 228
##
## $complete.dag
## [1] "A graph with 131 nodes."
title(main = "GO analysis using Weight01 algorithm", line = -2)
In this plot we can observe the relationship between the top 10 significant gene ontologies, using the Weight01 algorithm.
Here are the packages and the versions used to analyse these data and build this page:
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices datasets utils
## [8] methods base
##
## other attached packages:
## [1] sva_3.22.0 genefilter_1.56.0 mgcv_1.8-16
## [4] nlme_3.1-128 edgeR_3.16.4 limma_3.30.6
## [7] geneplotter_1.52.0 annotate_1.52.0 XML_3.98-1.5
## [10] AnnotationDbi_1.36.0 IRanges_2.8.1 S4Vectors_0.12.0
## [13] lattice_0.20-34 Biobase_2.34.0 BiocGenerics_0.20.0
## [16] BiocStyle_2.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.8 RColorBrewer_1.1-2 bitops_1.0-6
## [4] tools_3.3.2 digest_0.6.10 RSQLite_1.1
## [7] evaluate_0.10 memoise_1.0.0 Matrix_1.2-7.1
## [10] DBI_0.5-1 yaml_2.1.14 stringr_1.1.0
## [13] knitr_1.15.1 locfit_1.5-9.1 rprojroot_1.1
## [16] grid_3.3.2 survival_2.40-1 rmarkdown_1.2
## [19] magrittr_1.5 backports_1.0.4 codetools_0.2-15
## [22] htmltools_0.3.5 splines_3.3.2 xtable_1.8-2
## [25] stringi_1.1.2 RCurl_1.95-4.8